18 research outputs found

    How to Do a Million Watchpoints: Efficient Debugging Using Dynamic Instrumentation

    Get PDF
    Application debugging is a tedious but inevitable chore in any software development project. An effective debugger can make programmers more productive by allowing them to pause execution and inspect the state of the process, or monitor writes to memory to detect data corruption. The latter is a notoriously difficult category of bugs to diagnose and repair especially in pointer-heavy applications. The debugging challenges will increase with the arrival of multicore processors which require explicit parallelization of the user code to get any performance gains. Parallelization in turn can lead to more data debugging issues such as the detection of data races between threads. This paper leverages the increasing efficiency of runtime binary interpreters to provide a new concept of Efficient Debugging using Dynamic Instrumentation, or EDDI. The paper demonstrates for the first time the feasibility of using dynamic instrumentation on demand to accelerate software debuggers, especially when the available hardware support is lacking or inadequate. As an example, EDDI can simultaneously monitor millions of memory locations, without crippling the host processing platform. It does this in software and hence provides a portable debugging environment. It is also well suited for interactive debugging because of the low associated overheads. EDDI provides a scalable and extensible debugging framework that can substantially increase the feature set of standard off the shelf debuggers.Singapore-MIT Alliance (SMA

    Data remapping for design space optimization of embedded memory systems

    No full text
    In this paper, we present a novel linear time algorithm for data remapping that is (i) lightweight, (ii) fully automated and (iii) applicable in the context of pointer-centric programming languages with dynamic memory allocation support. All previous work in this area lacks one or more of these features. We go on to demonstrate a novel application of this algorithm as a key step in optimizing the design on an embedded memory system. Specifically, we show that by virtue of locality enhancements via data remapping, we may reduce the memory subsystem needs of an application by 50%, and hence concomitantly reduce the associated costs in terms of size, power, and dollar-investment (61%). Such a reduction overcomes key hurdles in designing high performance embedded computing solutions. Namely, memory subsystems are very desirable from a performance standpoint, but their costs have often limited their use in embedded systems. Thus, our innovative approach offers the intriguing possibility of compilers playing a significant role in exploring and optimizing the design space of a memory subsystem for an embedded design. To this end and in order to properly leverage the improvements afforded by a compiler optimization, we identify a range of measures for quantifying the cost-impact of popular notions of locality, prefetching, regularity of memory access and others. The proposed methodology will becom

    Versatility and VersaBench: A New Metric and a Benchmark Suite for Flexible Architectures

    No full text
    For the last several decades, computer architecture research has largely benefited from, and continues to be driven by ad-hoc benchmarking. Often the benchmarks are selected to represent workloads that architects believe should run on the computational platforms they design. For example, benchmark suites such as SPEC, Winstone, and MediaBench, which represent workstation, desktop and media workloads respectively, have influenced computer architecture innovation for the last decade. Recently, advances in VLSI technology have created an increasing interest within the computer architecture community to build a new kind of processor that is more flexible than extant general purpose processors. Such new processor architectures must efficiently support a broad class of applications including graphics, networking, and signal processing in addition to the traditional desktop workloads. Thus, given the new focus on flexibility demands, a new benchmark suite and new metrics are necessary to accurately reflect the goals of the architecture community. This paper thus proposes VersaBench as a new benchmark suite, and a new Versatility measure to characterize architectural flexibility, or in other words, the ability of the architecture to effectively execute a wide array of workloads. The benchmark suite is composed of applications drawn from several domains including desktop, server, stream, and bit-level processing. The Versatility measure is a single scalar metric inspired by the SPEC paradigm. It normalizes processor performance on each benchmark by that of the highest-performing machine for that application. This paper reports the measured versatility for several existing processors, as well as for some new and emerging research processors. The benchmark suite is freely distributed, and we are actively cataloging and sharing results for various reference processors

    A Productive Programming Environment for Stream Computing

    No full text
    This paper presents StreamIt and the StreamIt Development Tool. The development tool is an IDE designed to improve the coding, debugging, and visualization of streaming applications by exploiting the ability of the StreamIt language to naturally represent streaming codes as structured, hierarchical graphs. The StreamIt Development Tool aims to emulate the best of traditional debuggers and IDEs while moving toward hierarchical visualization and debugging concepts specialized for streaming applications. As such, it provides utilities for stream graph examination, tracking of data flow between streams, and deterministic execution of parallel streams. These features are in addition to more conventional tools for creating and editing codes, integrated compilers, setting breakpoints, watchpoints, and stepby-step program execution. A user study evaluating StreamIt and the development tool was held at MIT during which participants were given erroneous programs and asked to resolve the programming errors. We compared the productivity of the users when using the StreamIt Development Tool and its graphical features to those who were restricted to lineoriented debugging strategies, and we found that the former produced ten more correct solutions compared to the latter set of users. Furthermore, our data suggests that the graphical tool chain helped to mitigate user frustration and encouraged participants to invest more time tracking and fixing programming errors.

    Compiler orchestrated prefetching via speculation and predication

    No full text
    11th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XI189-19

    Adaptive compiler directed prefetching for epic processors

    No full text
    The widely acknowledged performance gap between processors and memory has been the subject of much research. In the Explicitly Parallel Instruction Computing (EPIC) paradigm, the combination of in-order issue and the presence of a large number of parallel functional units exacerbate the problem. Prefetching, by hardware, software, or a combination of both, is one of the primary mechanisms advocated to alleviate this problem. In this paper, we propose a new software-based data prefetching mechanism that is the Adaptive Markovian Predictor (AMP). AMP is suitable for implementation in EPIC processors without significant hardware overhead. Specifically, we introduce a predicated prefetch operation which leverages the concept of an informing load to dynamically adapt to runtime memory behaviors. Furthermore, we employ predicated prefetching in a new optimization framework that also consists of data remapping and off-line learning of Markovian predictors. This distinguishes our approach from early software prefetching techniques that only involve static program analysis. Our experiments show that the proposed framework can effectively remove 10%-30 % of the stall cycles due to cache misses for benchmarks from the well-known SPEC and OLDEN suites. The results also show that the framework performs better than pure stride predictors and has lower bandwidth and instruction overheads
    corecore